Code
data <- readRDS(here("data/NFIdata/NFIdata_cleaned.rds"))data <- readRDS(here("data/NFIdata/NFIdata_cleaned.rds"))Goal: estimate the association between the genomic offset and the mortality rates in natural populations.
flowchart LR C[Competition among trees] --> M[Mortality rates] DBH[Average tree age] --> M DBH <--> C GO[Genomic offset] --> M Country --> M CS[Census interval] --> M S[Spatial autocorrelation] --> C S --> GO S --> M S --> DBH style S fill:#D5F5E3,stroke:#2ECC71 style M fill:#F5B7B1,stroke:#E74C3C
The mortality rates in the NFI plots may be influenced by:
Competition among trees. To account for it, we use as proxy the variable \(C_{i}\), which is the basal area of all trees (all species confounded) in the plot \(i\).
Average tree age. To account for it, we use as proxy the variable \(DBH_{i}\), which is the mean diameter at breast height (DBH) of all maritime pines in the plot \(i\), including adults, dead trees and seedlings/saplings (ie recruitment).
Predicted genomic offset at the location of the plot \(i\). Proxy of the potential maladaptation experienced by the trees at this location.
Country. Indeed, the French and Spanish inventories present noticeable methodological differences that may bias the estimations, so we may want to estimate country-specific coefficients: the country-specific intercepts \(\beta_{0,c}\) and the country-specific slopes \(\beta_{C,c}\), \(\beta_{GO,c}\) and \(\beta_{BA,c}\). This is was I did during my PhD.
Census interval, that we account with an offset and a complementary log-log link function (see justification below).
Spatial autocorrelation among plots. To account for it, we may use Gaussian processes, following section 14.5 of Statistical Rethinking of Richard McElreath.
During my PhD, I also included the potential influence of fires on mortality rates. For that, I used as proxy the variable \(BA_{i}\), which is the estimated monthly burned area from the GFED4 database at the location of the plot \(i\). Note that mortality events due to fires should not be recorded in the NFI. However, high mortality rates are observed in Galicia (see exploratory analyses in report 8_ValidationNFI) and one potential explanation may be the higher fire-related mortality in this region. In the following analyses though, I decided not to incorporate fires as a potential cause of mortality rates, but this is something we can discuss!
Mathematical model
In order to account for the different census intervals between inventories, we modeled the proportion \(p_{i}\) of maritime pines that died in the plot \(i\) during the census interval with the complementary log-log link and an offset on the logarithm of the census interval \(\Delta_{i}\) for the plot i, as follows:
\[\begin{align*} m_i &\sim \text{Binomial}(N_i,p_i)\\ \text{log}(-\text{log}(1-p_i)) &= X\beta + \text{log}(\Delta_i) \\ \end{align*}\]
with \(N_{i}\) the total number of maritime pines in the plot \(i\), \(m_{i}\) the number of maritime pines that died during the census interval \(\Delta_{i}\) in the plot \(i\), \(X\) is the matrix of the intercept and the explanatory variables, \(\beta\) is the matrix of the coefficients associated with the explanatory variables (and a column of 1 for the intercept).
The log-log link:
\[\eta = \mathcal{C}(\mu) = \log (- \log (1 - \mu ))\]
The complementary log-log link = cloglog function:
\[ \mu = \mathcal{C}^{-1}(\eta) = 1 - \exp (- \exp ( \eta )) \]
With \(\mu\) the probability of mortality during a given exposure period.With \(\eta\) a linear predictor.
In our case, we know: \[ \Pr(\text{survival during } \Delta t) = \Pr(\text{annual survival})^{\Delta t} \]
With \(\Delta t\) equal to the exposure time.
Probability of survival during \(\Delta t\) is equal to \(1 - \mu_{t}\), with \(\mu_{t}\) equal to the probability of mortality during \(\Delta t\).
Annual probability of survival is equal to \(1 - \mu_{0}\), with \(\mu_{0}\) equal to the annual probability of mortality.
So we know: \[\begin{align*} 1 - \mu_{t} &= 1 - \mu_{0}^{\Delta t}\\ \mu_{t} &= 1 - ( 1 - \mu_{0})^{\Delta t} \end{align*}\]
We want to find which offset argument \(\lambda_{t}\) (= the variable included to take into account the different exposure times between individuals) we have to include in the model to predict the annual mortality probabilities and not the mortality probabilities at different ages of the trees. For that, we model the probability of mortality (and not the probability of survival), with 0 corresponding to alive trees and 1 to dead trees. \(\mu_{t}\) (the probability of mortality during \(\Delta t\)) is related to \(\eta + \lambda_{t}\): \[ \eta + \lambda_{t} = \mathcal{C}(\mu_{t}) = \log (- \log (1 - ( \mu_{t}))) \]
\(\mu_{0}\) (the annual probability of mortality) is related to \(\eta\). \[ \eta = \mathcal{C}(\mu_{0}) = \log (- \log (1 - \mu_{0})) \tag*{(3)} \]
We want to prove that: \[ \lambda_{t} = \log(\Delta t) \]
Let’s prove that \(\lambda_{t} = \log(\Delta t)\).
We know that (see the complementary log-log function): \[\mu_{t} = \mathcal{C}^{-1}(\eta + \lambda_{t}) = 1 - \exp (- \exp ( \eta + \lambda_{t})) \tag*{(1)} \]
And we know that: \[\mu_{t} = 1 - ( 1 - \mu_{0})^{\Delta t} \tag*{(2)} \]
Let’s merge (1) + (2) \[\begin{align*} 1 - \exp (- \exp ( \eta + \lambda_{t})) &= 1 - ( 1 - \mu_{0})^{\Delta t}\\ \exp (- \exp ( \eta + \lambda_{t})) &= (1 - \mu_{0})^{\Delta t}\\ -\exp( \eta + \lambda_{t}) &= \Delta t \log(1 - \mu_{0})\\ \exp( \eta + \lambda_{t}) &= - \Delta t \log(1 - \mu_{0})\\ \exp(\eta) \exp(\lambda_{t}) &= - \Delta t \log(1 - \mu_{0})\\ \exp(\lambda_{t}) &= - \Delta t \frac{\log(1 - \mu_{0})}{\exp(\eta)}\\ \exp(\lambda_{t}) &= - \Delta t \frac{\log(1 - \mu_{0})}{\exp(\log (- \log (1 - \mu_{0})))} \tag*{see (3)}\\ \exp(\lambda_{t}) &= - \Delta t \frac{\log(1 - \mu_{0})}{- \log (1 - \mu_{0})}\\ \exp(\lambda_{t}) &= \Delta t\\ \lambda_{t} &= \log(\Delta t)\\ \end{align*}\]
Some useful links:
We check if the models are able to recover the simulated parameter values (simulated under the assumption that the model is true). To do this, we start with a very simple model and we gradually make it more complex.
\[\begin{align*} m_i &\sim \text{Binomial}(N_i,p_i)\\ \text{log}(-\text{log}(1-p_i)) &= \beta_{0} + \beta_{GO} GO_i \\ \end{align*}\]
with \(\beta_{0}\) the intercept and \(GO_{i}\) the estimated genomic offset in the plot \(i\) and its associated slope \(\beta_{GO}\).
\[\begin{align*} m_i &\sim \text{Binomial}(N_i,p_i)\\ \text{log}(-\text{log}(1-p_i)) &= \beta_{0} + \beta_{GO} GO_i + \text{log}(\Delta_i) \\ \end{align*}\]
with \(\Delta_{i}\) the census interval for the plot \(i\). In this model, \(m_{i}\) is the number of maritime pines that died during the census interval \(\Delta_{i}\).
\[\begin{align*} m_i &\sim \text{Binomial}(N_i,p_i)\\ \text{log}(-\text{log}(1-p_i)) &= \beta_{0} + \beta_{GO} GO_i + \beta_C C_i + \text{log}(\Delta_i) \\ \end{align*}\]
with \(C_{i}\) the basal area of all trees (all species confounded) in the plot \(i\) (to account for the competition between trees) and its associated slope \(\beta_C\)
\[\begin{align*} m_i &\sim \text{Binomial}(N_i,p_i)\\ \text{log}(-\text{log}(1-p_i)) &= \beta_{0,c} + \beta_{C} C_i + \beta_{GO} GO_i + \text{log}(\Delta_i) \\ \end{align*}\]
with \(\beta_{0,c}\) the country-specific intercept, \(C_{i}\) the basal area of all trees (all species confounded) in the plot \(i\) (to account for the competition between trees) and its associated slope \(\beta_{C}\), \(GO_{i}\) the estimated genomic offset in the plot \(i\) and its associated slope \(\beta_{GO}\).
We might want to use country-specific varying intercepts (instead of fixed intercepts) but I thought it would add non-necessary complexity to the model (and there are only two countries to estimate of the variance among country-intercepts..). Finally, I did not include this model in the model comparison.
Non-centered version of the model (with the \(z\)-score) to facilitate convergence.
\[\begin{align*} m_{i} &\sim \text{Binomial}(N_{i},p_{i})\\ \text{log}(-\text{log}(1-p_{i})) &= z_{\beta_{0,c}} \sigma_{\beta_{0}} + z_{\beta_{C,c}} \sigma_{\beta_{C}} C_{i} + z_{\beta_{GO,c}} \sigma_{\beta_{GO}} GO_{i} + \text{log}(\Delta_{i}) \\[4pt] \begin{bmatrix} z_{\beta_{0,c}} \\ z_{\beta_{C,c}} \\ z_{\beta_{GO,c}} \end{bmatrix} & \sim \text{MVNormal}\left(\begin{bmatrix} 0 \\ 0 \\ 0 \\ 0 \end{bmatrix},\mathbf{R} \right) \\[4pt] \begin{bmatrix} \sigma_{\beta_{0}} \\ \sigma_{\beta_{C}}\\ \sigma_{\beta_{GO}} \end{bmatrix} & \sim \text{Exponential}(1) \\[4pt] \mathbf{R}& \sim \text{LKJcorr(4)} \end{align*}\]
We will use spatial Gaussian processes with a squared exponential covariance kernel to model the spatial processes.
Materials used to build this model:
the case study Robust Gaussian Process Modeling of Michael Betancourt.
the tutorial Stan Geostatistical Model of Nicholas Clark
Statistical Rethinking of McElreath.
We first consider a model with only a clog-log link and the spatial processes to determine whether we are able to accurately estimate the spatial autocorrelation parameters.
\[\begin{align*} m_i &\sim \text{Binomial}(N_i,p_i)\\ \text{log}(-\text{log}(1-p_i)) &= \beta_{0} + \mu_i\\ \end{align*}\]
The intercepts \(\mu_i\) capture the spatial effects. They have a multivariate Gaussian prior such as:
\[\begin{align*} \mu_i & \sim \text{MVNormal}(0,\mathbf{K})\\ K_{ij} &= \alpha^{2} \exp \left( - \frac{ D_{ij}^2 }{2\rho^2} \right) + \delta_{ij} \sigma^2 \end{align*}\]
\(\mathbf{K}\) is the covariance matrix of the \(\mu_i\) intercepts and belongs to the squared exponential family (the exponentiated quadratic covariance function in Stan’s manual. We can also talk about a squared exponential covariance kernel. This function says is that the covariance between any two spatial points \(i\) and \(j\) declines exponentially with the squared distance between them.
The parameter \(\rho\) determines the rate of decline, i.e. it controls how quickly the correlations between spatial points decay as a function of the distance between them. If it is large, then covariance declines rapidly with squared distance. \(\rho\) is often called the length scale parameter.
\(\alpha^{2}\) is the maximum covariance between any two spatial points \(i\) and \(j\) (Statistical Rethinking, McElreath). The parameter \(\alpha\) controls the marginal variability of the spatial function at all points; in other words it controls how much the spatial term contributes to the linear predictor (Stan Geostatistical Model of Nicholas Clark).
\(\delta_{ij} \sigma^2\) provides for extra covariance beyond \(\alpha^{2}\) when \(i = j\). It does this because the function \(\delta_{ij}\) is equal to 1 when \(i = j\) but is zero otherwise (Statistical Rethinking, McElreath). In our study, this term does not matter because we only have one observation for each NFI plot. But if we had more than one observation per plot, \(\sigma\) would describe how these observations covary.
So, in our study, when \(i=j\), \(K_{ij} = \alpha^{2}\).
Priors
In Statistical Rethinking, McElreath, define priors for the square of \(\rho\), \(\alpha\) and \(\sigma\) and estimate them on the same scale, because that’s computationally easier. Like in our study, he doesn’t need \(\sigma\) in his model, so he fixes it at an irrelevant constant. As \(\rho^{2}\) and \(\alpha^{2}\) must be positive, he uses exponential priors for them: \(\rho^{2} \sim \text{Exponential(0.5)}\) and \(\alpha^{2} \sim \text{Exponential(2)}\).
In Stan Geostatistical Model, Nicholas Clark uses: \(\rho \sim \text{Normal(2,2.5)}\) (very small values will not be identifiable, so he says that this prior is an informative prior) and \(\alpha \sim \text{Normal(0,1)}\).
In our study, the distances (in km) among plots are very large and so we use:
\[\begin{align*} \rho & \sim \text{Normal}(500,200)\\ \alpha & \sim \text{Normal}(0,1) \end{align*}\]
We check what the prior distributions imply for the covariance functions.
plot(NULL, xlab="Distance (in km)" , ylab= "Covariance", xlim=c(0,2150), ylim=c(0,6))
n_curve <- 50
rho <- rnorm(n_curve, 500, 200)
alpha <- rnorm(n_curve, 0, 01)
for ( i in 1:n_curve ) curve(alpha[i]^2*exp(-0.5 * ((x / rho[i])^2)) , add=T, col = col.alpha("darkgreen",0.5))Problem: this model is already long to fit on 200 simulated spatial points, so it would not be possible to fit it on the 11917 NFI plots… So, I decided not to include spatial autocorrelation into the models. However, I checked that the model M3 was able to accurately estimate the association between the mortality rates and the genomic offset predictions on simulated data with spatial autocorrelation. The rational was that, if the model M3 shows reasonable accuracy on simulated data with auto-correlation, we could use the model M3 to estimate the association between the mortality rates and the genomic offset predictions even if this model does not account for spatial autocrrelation.
If it had been possible to fit the M5_1 model to the NFI plots, the next steps would have been to incorporate genomic offset predictions and basal area into the model, such as: \[\begin{align*} m_i &\sim \text{Binomial}(N_i,p_i)\\ \text{log}(-\text{log}(1-p_i)) &= \beta_{0,c} + \beta_{C,c} C_i + \beta_{GO,c} GO_i + \text{log}(\Delta_i) + \mu_i\\ \end{align*}\]
However, as I did not manage to fit the M5_2 model to the full dataset, I have not evaluated this model in this report.
We add to model M3 a third predictor.
\[\begin{align*} m_i &\sim \text{Binomial}(N_i,p_i)\\ \text{log}(-\text{log}(1-p_i)) &= \beta_{0} + \beta_{GO} GO_i + \beta_C C_i + \beta_{DBH} DBH_i + \text{log}(\Delta_i) \\ \end{align*}\]
with \(DBH_{i}\) the mean diameter at breast height (DBH) of the maritime pines in the plot \(i\) (including adults, dead trees and recruitment trees) and its associated slope \(\beta_{DBH}\).
M6_2 models the interaction between \(C_i\) and \(DBH_{i}\).
\[\begin{align*} m_i &\sim \text{Binomial}(N_i,p_i)\\ \text{log}(-\text{log}(1-p_i)) &= \beta_{0} + \beta_{GO} GO_i + \beta_C C_i + \beta_{DBH} DBH_i + \beta_{C*DBH} C_i DBH_i + \text{log}(\Delta_i) \\ \end{align*}\]
To determine whether the models are able to recover the true parameters (i.e. the simulate parameters), we perform 100 simulations:
nb_simulations <- 100# Some functions to show model coefficients and coverage across simulations
make_coeff_table <- function(mod,pars=NULL,true_params){
conf95 <- broom.mixed::tidyMCMC(mod,
pars=pars,
droppars = NULL,
robust = FALSE, # give mean and standard deviation
ess = T, # effective sample size estimates
rhat = T, # Rhat estimates
conf.int = T, conf.level = 0.95 # include 95% credible intervals
) %>%
dplyr::rename(conf95_low=conf.low,
conf95_high=conf.high,
mean=estimate,
std_deviation=std.error)
broom.mixed::tidyMCMC(mod,
pars=pars,
droppars = NULL,
robust = TRUE, # give median and mean absolute deviation (= avg distance btw each data point and the mean)
ess = F, rhat = F,
conf.int = T, conf.level = 0.80 # include 80% credible intervals
) %>%
dplyr::rename(conf80_low=conf.low,
conf80_high=conf.high,
median=estimate,
mean_absolute_deviation=std.error) %>%
inner_join(conf95,by=c("term")) %>%
dplyr::select(term,
mean, mean_absolute_deviation,
median, std_deviation,
conf80_low,conf80_high,
conf95_low,conf95_high,
rhat,ess) %>%
mutate(true_values = true_params) %>%
mutate(conf80_coverage = if_else(conf80_low<=true_values & conf80_high>=true_values, TRUE,FALSE),
conf95_coverage = if_else(conf95_low<=true_values & conf95_high>=true_values, TRUE,FALSE))
}
calc_coverage_across_sims <- function(x){
x %>%
group_by(term) %>%
group_split() %>%
purrr::map(\(x){
data.frame(parameter=unique(x$term),
conf80_coverage=sum(x$conf80_coverage == TRUE),
conf95_coverage=sum(x$conf95_coverage == TRUE))
}) %>%
bind_rows()}Stan code of M0:
# Stan code
stancode = stan_model(here("scripts/StanModels/ValidationNFI_mo.stan"))
print(stancode)S4 class stanmodel 'anon_model' coded as follows:
data {
int N;
vector[N] GO; // Genomic offset
int nb_dead[N]; // Number of dead trees in the plot
int nb_tot[N]; // Total number of trees in the plot
}
parameters {
real beta_0;
real beta_GO;
}
model {
nb_dead ~ binomial(nb_tot,inv_cloglog(beta_0 + beta_GO * GO));
beta_0 ~ normal(0, 1);
beta_GO ~ normal(0, 1);
}
# Parameters to estimate
params_to_estimate <- c("beta_0","beta_GO")set.seed(49205)
nb_setseed <- sample(1:1000,nb_simulations,replace=FALSE)
# Experimental design
N <- 1000 # nb of plots
nb_tot <- rep(30,N) # nb of trees per plot
# Data simulation
lapply(nb_setseed,function(seed){
set.seed(seed)
beta_0 <- runif(1,-5.5,-4.5) # intercept
beta_GO <- runif(1,0.25,0.75) # coeff of the genomic offset
GO <- rnorm(N,0,1) # genomic offset values
GO <- (GO - mean(GO)) / sd(GO) # scaling the genomic offset values
eta <- beta_0 + beta_GO * GO # linear predictor
p <- 1-exp(-exp(eta)) # probability of mortality in each plot
nb_dead <- rbinom(N,nb_tot, p) # nb of dead trees in each plot
# Stan list
datalist <- list(N=N,
nb_dead=nb_dead,
nb_tot=nb_tot,
GO=GO)
# Running Stan model
mod <- sampling(stancode, data = datalist, iter = 2000, chains = 4, cores = 4,init=0, save_warmup = FALSE)
# Build coeff table
make_coeff_table(mod,
pars = params_to_estimate,
true_params = c(beta_0,beta_GO))
}) %>%
bind_rows(.id = "nb_sim") %>%
saveRDS(file=here(paste0("outputs/ValidationNFI/Simulations/",nb_simulations,"simulations_m0.rds")))To compare the model estimates with the true parameter values, the following table shows the coverage of the 80% and 95% credible intervals for the \(\beta_0\) and \(\beta_{GO}\) coefficients.
readRDS(file=here(paste0("outputs/ValidationNFI/Simulations/",nb_simulations,"simulations_m0.rds"))) %>%
calc_coverage_across_sims %>%
kable_mydf()| parameter | conf80_coverage | conf95_coverage |
|---|---|---|
| beta_0 | 74 | 96 |
| beta_GO | 84 | 95 |
# Stan code
stancode = stan_model(here("scripts/StanModels/ValidationNFI_m1.stan"))
print(stancode)S4 class stanmodel 'anon_model' coded as follows:
data {
int N;
vector[N] log_nb_years; // Offset to account for different census intervals
vector[N] GO; // Genomic offset
int nb_dead[N]; // Number of dead trees in the plot
int nb_tot[N]; // Total number of trees in the plot
}
parameters {
real beta_0;
real beta_GO;
}
model {
nb_dead ~ binomial(nb_tot,inv_cloglog(beta_0 + beta_GO * GO + log_nb_years));
beta_0 ~ normal(0, 1);
beta_GO ~ normal(0, 1);
}
set.seed(494442)
nb_setseed <- sample(1:1000,nb_simulations,replace=FALSE)
# Experimental design
N <- 1000 # nb of plots
nb_tot <- rep(30,N) # nb of trees per plot
# Data simulation
lapply(nb_setseed,function(seed){
set.seed(seed)
nb_years <- sample(5:15, N, replace=T) # random time periods between inventory dates (between 5 and 15 years)
beta_0 <- runif(1,-6.5,-5.5) # intercept
beta_GO <- runif(1,0.25,0.75) # coeff of the genomic offset
GO <- rnorm(N,0,1) # genomic offset
GO <- (GO - mean(GO)) / sd(GO) # scaling the genomic offset
eta <- beta_0 + beta_GO * GO + log(nb_years) # linear predictor
p <- 1-exp(-exp(eta)) # probability of mortality in each plot
nb_dead <- rbinom(N,nb_tot, p) # nb of dead trees in each plot
# Stan list
datalist <- list(N=N,
nb_dead=nb_dead,
nb_tot=nb_tot,
GO=GO,
log_nb_years=log(nb_years))
# Running stan model
mod <- sampling(stancode, data = datalist, iter = 2000, chains = 4, cores = 4,init=0,save_warmup = FALSE)
# Build coeff table
make_coeff_table(mod,
pars = params_to_estimate,
true_params = c(beta_0,beta_GO))
}) %>%
bind_rows(.id = "nb_sim") %>%
saveRDS(file=here(paste0("outputs/ValidationNFI/Simulations/",nb_simulations,"simulations_m1.rds")))readRDS(file=here(paste0("outputs/ValidationNFI/Simulations/",nb_simulations,"simulations_m1.rds"))) %>%
calc_coverage_across_sims %>%
kable_mydf()| parameter | conf80_coverage | conf95_coverage |
|---|---|---|
| beta_0 | 76 | 94 |
| beta_GO | 82 | 94 |
# Stan code
stancode = stan_model(here("scripts/StanModels/ValidationNFI_m2.stan"))
print(stancode)S4 class stanmodel 'anon_model' coded as follows:
data {
int N;
vector[N] C; // Proxy of the competition among trees (i.e. basal area)
vector[N] log_nb_years; // Offset to account for different census intervals
vector[N] GO; // Genomic offset
int nb_dead[N]; // Number of dead trees in the plot
int nb_tot[N]; // Total number of trees in the plot
}
parameters {
real beta_0;
real beta_GO;
real beta_C;
}
model {
nb_dead ~ binomial(nb_tot,inv_cloglog(beta_0 + beta_GO * GO + beta_C * C + log_nb_years));
beta_0 ~ normal(0, 1);
beta_GO ~ normal(0, 1);
beta_C ~ normal(0, 1);
}
# Parameters to estimate
params_to_estimate <- c("beta_0","beta_GO","beta_C")set.seed(49205)
nb_setseed <- sample(1:1000,nb_simulations,replace=FALSE)
# Experimental design
N <- 1000 # nb of plots
nb_tot <- rep(30,N) # nb of trees per plot
# Data simulation
lapply(nb_setseed,function(seed){
set.seed(seed)
nb_years <- sample(5:15, N, replace=T) # random time periods between inventory dates (between 5 and 15 years)
beta_0 <- runif(1,-7.5,-6.5) # intercept
beta_GO <- runif(1,0.25,0.75) # coeff of the genomic offset
beta_C <- runif(1,0.5,1.5) # coeff of the basal area
GO <- rnorm(N,0,1) # genomic offset
GO <- (GO - mean(GO)) / sd(GO) # scaling the genomic offset
C <- rnorm(N,0,1) # basal area (capturing the competition among trees)
C <- (C - mean(C)) / sd(C) # scaling the basal area
eta <- beta_0 + beta_GO * GO + beta_C * C + log(nb_years) # linear predictor
p <- 1-exp(-exp(eta)) # probability of mortality in each plot
nb_dead <- rbinom(N,nb_tot, p) # nb of dead trees in each plot
# Stan list
datalist <- list(N=N,
nb_dead=nb_dead,
nb_tot=nb_tot,
GO=GO,
C=C,
log_nb_years=log(nb_years))
# Running stan model
mod <- sampling(stancode, data = datalist, iter = 2000, chains = 4, cores = 4,init=0,save_warmup = FALSE)
# Build coeff table
make_coeff_table(mod,
pars = params_to_estimate,
true_params = c(beta_0,beta_GO,beta_C))
}) %>%
bind_rows(.id = "nb_sim") %>%
saveRDS(file=here(paste0("outputs/ValidationNFI/Simulations/",nb_simulations,"simulations_m2_balanceddesign.rds")))readRDS(file=here(paste0("outputs/ValidationNFI/Simulations/",nb_simulations,"simulations_m2_balanceddesign.rds"))) %>%
calc_coverage_across_sims %>%
kable_mydf()| parameter | conf80_coverage | conf95_coverage |
|---|---|---|
| beta_0 | 76 | 90 |
| beta_C | 76 | 91 |
| beta_GO | 82 | 96 |
set.seed(420453)
nb_setseed <- sample(1:1000,nb_simulations,replace=FALSE)
# Experimental design
N <- 1000 # nb of plots
# Data simulation
lapply(nb_setseed,function(seed){
set.seed(seed)
nb_tot <-sample(5:30, N, replace=T) # nb of trees per plot
nb_years <- sample(5:15, N, replace=T) # random time periods between inventory dates (between 5 and 15 years)
beta_0 <- runif(1,-7.5,-6.5) # intercept
beta_GO <- runif(1,0.25,0.75) # coeff of the genomic offset
beta_C <- runif(1,0.5,1.5) # coeff of the basal area
GO <- rnorm(N,0,1) # genomic offset
GO <- (GO - mean(GO)) / sd(GO) # scaling the genomic offset
C <- rnorm(N,0,1) # basal area
C <- (C - mean(C)) / sd(C) # scaling the basal area
eta <- beta_0 + beta_GO * GO + beta_C * C + log(nb_years) # linear predictor
p <- 1-exp(-exp(eta)) # probability of mortality in each plot
nb_dead <- rbinom(N,nb_tot, p) # nb of dead trees in each plot
# Stan list
datalist <- list(N=N,
nb_dead=nb_dead,
nb_tot=nb_tot,
GO=GO,
C=C,
log_nb_years=log(nb_years))
# Running stan model
mod <- sampling(stancode, data = datalist, iter = 2000, chains = 4, cores = 4,init=0,save_warmup = FALSE)
# Build coeff table
make_coeff_table(mod,
pars = params_to_estimate,
true_params = c(beta_0,beta_GO,beta_C))
}) %>%
bind_rows(.id = "nb_sim") %>%
saveRDS(file=here(paste0("outputs/ValidationNFI/Simulations/",nb_simulations,"simulations_m2_unbalanceddesign.rds")))readRDS(file=here(paste0("outputs/ValidationNFI/Simulations/",nb_simulations,"simulations_m2_unbalanceddesign.rds"))) %>%
calc_coverage_across_sims %>%
kable_mydf()| parameter | conf80_coverage | conf95_coverage |
|---|---|---|
| beta_0 | 71 | 86 |
| beta_C | 75 | 91 |
| beta_GO | 82 | 94 |
set.seed(4453)
nb_setseed <- sample(1:1000,nb_simulations,replace=FALSE)
# Experimental design
N <- length(data$plotcode) # same nb of plots as true data
nb_tot <- data$nb_tot # same nb of trees per plot as true data
nb_years <- data$nb_years # same time periods between inventory dates
# Data simulation
lapply(nb_setseed,function(seed){
set.seed(seed)
beta_0 <- runif(1,-7.5,-6.5) # intercept
beta_GO <- runif(1,0.25,0.75) # coeff of the genomic offset
beta_C <- runif(1,0.5,1.5) # coeff of the basal area
GO <- rnorm(N,0,1) # genomic offset
GO <- (GO - mean(GO)) / sd(GO) # scaling the genomic offset
C <- rnorm(N,0,1) # basal area (capturing the competition among trees)
C <- (C - mean(C)) / sd(C) # scaling the basal area
eta <- beta_0 + beta_GO * GO + beta_C * C + log(nb_years) # linear predictor
p <- 1-exp(-exp(eta)) # probability of mortality in each plot
nb_dead <- rbinom(N,nb_tot, p) # nb of dead trees in each plot
# Stan list
datalist <- list(N=N,
nb_dead=nb_dead,
nb_tot=nb_tot,
GO=GO,
C=C,
log_nb_years=log(nb_years))
# Running stan model
mod <- sampling(stancode, data = datalist, iter = 2000, chains = 4, cores = 4,init=0, save_warmup = FALSE)
# Build coeff table
make_coeff_table(mod,
pars = params_to_estimate,
true_params = c(beta_0,beta_GO,beta_C))
}) %>%
bind_rows(.id = "nb_sim") %>%
saveRDS(file=here(paste0("outputs/ValidationNFI/Simulations/",nb_simulations,"simulations_m2_truedesign.rds")))readRDS(file=here(paste0("outputs/ValidationNFI/Simulations/",nb_simulations,"simulations_m2_truedesign.rds"))) %>%
calc_coverage_across_sims %>%
kable_mydf()| parameter | conf80_coverage | conf95_coverage |
|---|---|---|
| beta_0 | 74 | 94 |
| beta_C | 80 | 94 |
| beta_GO | 78 | 96 |
# Stan code
stancode = stan_model(here("scripts/StanModels/ValidationNFI_m3.stan"))
print(stancode)S4 class stanmodel 'anon_model' coded as follows:
data {
int N;
vector[N] log_nb_years; // Offset to account for different census intervals
vector[N] C; // Proxy of the competition among trees (i.e. basal area)
vector[N] GO; // Genomic offset
int nb_dead[N]; // Number of dead trees in the plot
int nb_tot[N]; // Total number of trees in the plot
int<lower=0> nb_country; // Number of countries
int<lower=0, upper=nb_country> country[N]; // Countries
}
parameters {
vector[nb_country] alpha_country;
real beta_GO;
real beta_C;
}
model {
nb_dead ~ binomial(nb_tot,inv_cloglog(alpha_country[country] + beta_GO * GO + beta_C * C + log_nb_years));
alpha_country ~ normal(0, 1);
beta_GO ~ normal(0, 1);
beta_C ~ normal(0, 1);
}
# Parameters to estimate
params_to_estimate <- c("alpha_country","beta_GO","beta_C")set.seed(4945424)
nb_setseed <- sample(1:1000,nb_simulations,replace=FALSE)
# Experimental design
N <- 1000 # nb of plots
nb_tot <- rep(30,N) # nb of trees per plot
nb_country <- 2
country <- c(rep(1,N/2),rep(2,N/2))
# Data simulation
lapply(nb_setseed,function(seed){
set.seed(seed)
nb_years <- sample(5:15, N, replace=T) # random time periods between inventory dates (between 5 and 15 years)
alpha_country <- c(runif(1,-7.5,-6.5),runif(1,-6.5,-5.5)) # country-specific intercepts
beta_GO <- runif(1,0.25,0.75) # coeff of the genomic offset
beta_C <- runif(1,0.5,1.5) # coeff of the basal area
GO <- rnorm(N,0,1) # genomic offset
GO <- (GO - mean(GO)) / sd(GO) # scaling the genomic offset
C <- rnorm(N,0,1) # basal area
C <- (C - mean(C)) / sd(C) # scaling the basal area
eta <- alpha_country[country] + beta_GO * GO + beta_C * C + log(nb_years) # linear predictor
p <- 1-exp(-exp(eta)) # probability of mortality in each plot
nb_dead <- rbinom(N,nb_tot, p) # nb of dead trees in each plot
# Stan list
datalist <- list(N=N,
nb_dead=nb_dead,
nb_tot=nb_tot,
GO=GO,
C=C,
nb_country=nb_country,
country=country,
log_nb_years=log(nb_years))
# Running stan model
mod <- sampling(stancode, data = datalist, iter = 2000, chains = 4, cores = 4,init=0, save_warmup = FALSE)
# Build coeff table
make_coeff_table(mod,
pars=params_to_estimate,
true_params = c(alpha_country,beta_GO,beta_C))
}) %>%
bind_rows(.id = "nb_sim") %>%
saveRDS(file=here(paste0("outputs/ValidationNFI/Simulations/",nb_simulations,"simulations_m3_balanceddesign.rds")))readRDS(file=here(paste0("outputs/ValidationNFI/Simulations/",nb_simulations,"simulations_m3_balanceddesign.rds"))) %>%
calc_coverage_across_sims %>%
kable_mydf()| parameter | conf80_coverage | conf95_coverage |
|---|---|---|
| alpha_country[1] | 66 | 86 |
| alpha_country[2] | 76 | 90 |
| beta_C | 79 | 92 |
| beta_GO | 83 | 96 |
set.seed(53)
nb_setseed <- sample(1:1000,nb_simulations,replace=FALSE)
# Experimental design
N <- 1000 # nb of plots
nb_country <- 2 # nb of countries
country <- c(rep(1,N/2),rep(2,N/2))
# Data simulation
lapply(nb_setseed,function(seed){
set.seed(seed)
nb_tot <-sample(5:30, N, replace=T) # nb of trees per plot
nb_years <- sample(5:15, N, replace=T) # random time periods between inventory dates (between 5 and 15 years)
alpha_country <- c(runif(1,-7.5,-6.5),runif(1,-6.5,-5.5)) # country-specific intercepts
beta_GO <- runif(1,0.25,0.75) # coeff of the genomic offset
beta_C <- runif(1,0.5,1.5) # coeff of the basal area
GO <- rnorm(N,0,1) # genomic offset
GO <- (GO - mean(GO)) / sd(GO) # scaling the genomic offset
C <- rnorm(N,0,1) # basal area
C <- (C - mean(C)) / sd(C) # scaling the basal area
eta <- alpha_country[country] + beta_GO * GO + beta_C * C + log(nb_years) # linear predictor
p <- 1-exp(-exp(eta)) # probability of mortality in each plot
nb_dead <- rbinom(N,nb_tot, p) # nb of dead trees in each plot
# Stan list
datalist <- list(N=N,
nb_dead=nb_dead,
nb_tot=nb_tot,
GO=GO,
C=C,
nb_country=nb_country,
country=country,
log_nb_years=log(nb_years))
# Running stan model
mod <- sampling(stancode, data = datalist, iter = 2000, chains = 4, cores = 4,init=0, save_warmup = FALSE)
# Build coeff table
make_coeff_table(mod,
pars=params_to_estimate,
true_params = c(alpha_country,beta_GO,beta_C))
}) %>%
bind_rows(.id = "nb_sim") %>%
saveRDS(file=here(paste0("outputs/ValidationNFI/Simulations/",nb_simulations,"simulations_m3_unbalanceddesign.rds")))readRDS(file=here(paste0("outputs/ValidationNFI/Simulations/",nb_simulations,"simulations_m3_unbalanceddesign.rds"))) %>%
calc_coverage_across_sims %>%
kable_mydf()| parameter | conf80_coverage | conf95_coverage |
|---|---|---|
| alpha_country[1] | 64 | 88 |
| alpha_country[2] | 67 | 86 |
| beta_C | 71 | 88 |
| beta_GO | 79 | 94 |
set.seed(49205)
nb_setseed <- sample(1:1000,nb_simulations,replace=FALSE)
# Experimental design
N <- length(data$plotcode) # same nb of plots as true data
nb_tot <- data$nb_tot # same nb of trees per plot as true data
nb_years <- data$nb_years # same time periods between inventory dates
# Data simulation
lapply(nb_setseed,function(seed){
set.seed(seed)
nb_country <- length(unique(data$country))
country <- as.numeric(data$country)
alpha_country <- runif(2,min=-8,max=-5) # country-specific intercepts
beta_GO <- runif(1,-0.7,0.7) # coeff of the genomic offset
beta_C <- runif(1,-1.3,1.3) # coeff of the basal area
GO <- rnorm(N,0,1) # genomic offset
GO <- (GO - mean(GO)) / sd(GO) # scaling the genomic offset
C <- rnorm(N,0,1) # basal area
C <- (C - mean(C)) / sd(C) # scaling the basal area
eta <- alpha_country[country] + beta_GO * GO + beta_C * C + log(nb_years) # linear predictor
p <- 1-exp(-exp(eta)) # probability of mortality in each plot
nb_dead <- rbinom(N,nb_tot, p) # nb of dead trees in each plot
# Stan list
datalist <- list(N=N,
nb_dead=nb_dead,
nb_tot=nb_tot,
GO=GO,
C=C,
nb_country=nb_country,
country=country,
log_nb_years=log(nb_years))
# Running stan model
mod <- sampling(stancode, data = datalist, iter = 2000, chains = 4, cores = 4,init=0, save_warmup = FALSE)
# Build coeff table
make_coeff_table(mod,
pars = params_to_estimate,
true_params = c(alpha_country,beta_GO,beta_C))
}) %>%
bind_rows(.id = "nb_sim") %>%
saveRDS(file=here(paste0("outputs/ValidationNFI/Simulations/",nb_simulations,"simulations_m3_truedesign.rds")))readRDS(file=here(paste0("outputs/ValidationNFI/Simulations/",nb_simulations,"simulations_m3_truedesign.rds"))) %>%
calc_coverage_across_sims %>%
kable_mydf()| parameter | conf80_coverage | conf95_coverage |
|---|---|---|
| alpha_country[1] | 80 | 93 |
| alpha_country[2] | 74 | 92 |
| beta_C | 83 | 97 |
| beta_GO | 82 | 99 |
In this set of simulations, we use the NFI data but we replace the genomic offset predictions at the location of the NFI plots by a randomly generated variable such as \(x \sim \mathcal{N}(0,1)\).
run_randomGO <- function(stancode,data){
random_GO <- rnorm(nrow(data),mean=0,sd=1)
datalist <- list(N=nrow(data),
nb_dead=data$nb_dead,
nb_tot=data$nb_tot,
GO=(random_GO - mean(random_GO)) / sd(random_GO),
C=(data$basal_area-mean(data$basal_area))/sd(data$basal_area),
nb_country=length(unique(data$country)),
country=as.numeric(data$country),
log_nb_years=log(data$nb_years))
# Running stan model
mod <- sampling(stancode, data = datalist, iter = 2000, chains = 4, cores = 4,init=0, save_warmup = FALSE)
# Model coefficients
conf95 <- broom.mixed::tidyMCMC(mod,
pars=params_to_estimate,
droppars = NULL,
robust = FALSE, # give mean and standard deviation
ess = T, # effective sample size estimates
rhat = T, # Rhat estimates
conf.int = T, conf.level = 0.95 # include 95% credible intervals
) %>%
dplyr::rename(conf95_low=conf.low,
conf95_high=conf.high,
mean=estimate,
std_deviation=std.error)
broom.mixed::tidyMCMC(mod,
pars=params_to_estimate,
droppars = NULL,
robust = TRUE, # give median and mean absolute deviation (= avg distance btw each data point and the mean)
ess = F, rhat = F,
conf.int = T, conf.level = 0.80 # include 80% credible intervals
) %>%
dplyr::rename(conf80_low=conf.low,
conf80_high=conf.high,
median=estimate,
mean_absolute_deviation=std.error) %>%
inner_join(conf95,by=c("term")) %>%
dplyr::select(term,
mean, mean_absolute_deviation,
median, std_deviation,
conf80_low,conf80_high,
conf95_low,conf95_high,
rhat,ess)
}
set.seed(495)
lapply(1:nb_simulations, function(x) run_randomGO(stancode,data)) %>% saveRDS(here(paste0("outputs/ValidationNFI/Simulations/m3_truedata_randomGO_",nb_simulations,"simulations.rds")))# Function to count the number of simulations in which the 95% and 80% credible intervals overlap with zero.
count_overlapp_with_zero_across_sims <- function(x){
x %>%
group_by(term) %>%
group_split() %>%
purrr::map(\(x){
data.frame(parameter=unique(x$term),
conf80_include_zero=sum(x$conf80_include_zero == TRUE),
conf95_include_zero=sum(x$conf95_include_zero == TRUE))
}) %>%
bind_rows()}
# Extract the outputs and identify the simulations in which the 95% and 80% credible intervals overlap with zero.
df <- readRDS(file=here(paste0("outputs/ValidationNFI/Simulations/m3_truedata_randomGO_",nb_simulations,"simulations.rds"))) %>%
bind_rows(.id="sim_ID") %>%
dplyr::filter(term == "beta_GO") %>%
mutate(conf80_include_zero = if_else(conf80_low<=0 & conf80_high>=0, TRUE,FALSE),
conf95_include_zero = if_else(conf95_low<=0 & conf95_high>=0, TRUE,FALSE))
# How many times the 95% and 80% credible intervals include zero?
df %>%
count_overlapp_with_zero_across_sims %>%
kable_mydf()| parameter | conf80_include_zero | conf95_include_zero |
|---|---|---|
| beta_GO | 36 | 44 |
# Show the mean and 95% / 80% credible intervals with interval plots
df %>%
ggplot(aes(mean, reorder(sim_ID, mean))) +
geom_vline(xintercept = 0, color="gray30") +
geom_errorbar(aes(xmin = conf95_low, xmax = conf95_high), color="#009E73") +
geom_errorbar(aes(xmin = conf80_low, xmax = conf80_high), color="orange") +
geom_point() +
theme_bw() +
labs(x="Effect size of a randomly generated variable", y="ID of the simulations") +
theme(axis.text.y = element_text(size=6))The code below is mostly based on the tutorial Stan Geostatistical Model of Nicholas Clark.
We first generate some fake spatial points within the spatial range of the NFI plots.
# Generating coordinates of spatial points based on the geographical range of the NFI plots
# =========================================================================================
N <- 200 # nb of plots
lat <- runif(n = N, min = min(data$latitude), max = max(data$latitude))
lon <- runif(n = N, min = min(data$longitude), max = max(data$longitude))
coords <- as.matrix(cbind(lon, lat))
colnames(coords) <- c('Longitude', 'Latitude')We simulate 200 geospatial coordinates within the geographic area of the NFI plots: latitude and longitude coordinates are sampled based on the minimum/maximum latitudes and longitudes of the NFI plots.
We then calculate distances (in meters) between the points, ensure the resulting matrix is symmetric, and then convert distances to kilometers.
m <- pointDistance(coords, lonlat = TRUE) # calculate distances (in meters) between the points
m[upper.tri(m)] = t(m)[upper.tri(m)] # ensure the resulting matrix is symmetric
m <- m / 1000 # convert distances to kilometers
m[1:10,1:10] # preview the matrix [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] 0.0000 1187.7291 1066.1934 1335.6785 532.3548 1387.7242 951.8294 957.3336 1249.5280 1073.8229
[2,] 1187.7291 0.0000 423.3596 157.5926 1336.2816 1129.5277 1319.3939 986.3943 1041.3719 360.9065
[3,] 1066.1934 423.3596 0.0000 550.7623 1378.6763 1468.1351 1503.1758 1225.1371 1358.6983 700.8530
[4,] 1335.6785 157.5926 550.7623 0.0000 1452.0022 1126.0241 1391.1026 1043.6973 1054.5010 407.1606
[5,] 532.3548 1336.2816 1378.6763 1452.0022 0.0000 1058.7853 479.3696 652.2654 937.1620 1082.7107
[6,] 1387.7242 1129.5277 1468.1351 1126.0241 1058.7853 0.0000 643.4014 432.3474 139.4959 777.6579
[7,] 951.8294 1319.3939 1503.1758 1391.1026 479.3696 643.4014 0.0000 360.3180 551.8977 984.2593
[8,] 957.3336 986.3943 1225.1371 1043.6973 652.2654 432.3474 360.3180 0.0000 297.2924 637.2568
[9,] 1249.5280 1041.3719 1358.6983 1054.5010 937.1620 139.4959 551.8977 297.2924 0.0000 682.4744
[10,] 1073.8229 360.9065 700.8530 407.1606 1082.7107 777.6579 984.2593 637.2568 682.4744 0.0000
# Another way of doing it, which
# gives the same distance matrix:
# library(geodist)
# t <- coords %>% geodist(measure="geodesic")
# t[1:10,1:10]In this matrix, the maximum distance is 2045 km.
We then assign values to the parameter \(\rho\) and \(\alpha\) to ensure that there is obvious spatial autocorrelation.
alpha <- 1 # maximum covariance
rho <- 500 # spatial decay of 500 km We calculate the squared exponential covariance matrix. A small offset is used on the diagonals to ensure the resulting matrix is positive definite. This is a requirement for simulating multivariate normal draws.
K <- alpha^2 * exp(-0.5 * ((m / rho) ^ 2)) + diag(1e-9, N)
K[1:10,1:10] [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] 1.00000000 0.05952250 0.10294743 0.02821013 0.56733619 0.02124663 0.16333396 0.15993701 0.04404072 0.09964003
[2,] 0.05952250 1.00000000 0.69874685 0.95154257 0.02811935 0.07795172 0.03075815 0.14285181 0.11430204 0.77066081
[3,] 0.10294743 0.69874685 1.00000000 0.54515880 0.02233732 0.01342225 0.01089910 0.04969148 0.02491906 0.37441528
[4,] 0.02821013 0.95154257 0.54515880 1.00000000 0.01474840 0.07919354 0.02085142 0.11319900 0.10818152 0.71780346
[5,] 0.56733619 0.02811935 0.02233732 0.01474840 1.00000000 0.10624022 0.63154230 0.42703027 0.17264028 0.09589276
[6,] 0.02124663 0.07795172 0.01342225 0.07919354 0.10624022 1.00000000 0.43695249 0.68808117 0.96182939 0.29834532
[7,] 0.16333396 0.03075815 0.01089910 0.02085142 0.63154230 0.43695249 1.00000000 0.77131521 0.54379540 0.14405892
[8,] 0.15993701 0.14285181 0.04969148 0.11319900 0.42703027 0.68808117 0.77131521 1.00000000 0.83797628 0.44388376
[9,] 0.04404072 0.11430204 0.02491906 0.10818152 0.17264028 0.96182939 0.54379540 0.83797628 1.00000000 0.39394556
[10,] 0.09964003 0.77066081 0.37441528 0.71780346 0.09589276 0.29834532 0.14405892 0.44388376 0.39394556 1.00000000
Let’s visualize the shape of the function relating distance to the covariance \(K_{ij}\):
curve(alpha^2 * exp(-0.5 * ((x / rho) ^ 2)),
from=0, to=2150,
lty=2,
xlab="Distance (in km)" ,
ylab="Covariance" )We simulate the latent Gaussian Process function as random draws from a zero-centered multivariate normal distribution with \(K\) as the covariance matrix. Let’s visualize the spatial autocorrelation:
set.seed(4924)
mu <- MASS::mvrnorm(1, mu = rep(0, N), Sigma = K)
ggplot(data.frame(GP = mu,
Longitude = lon,
Latitude = lat),
aes(x = Longitude, y = Latitude, fill = GP)) +
geom_point(size = 5, shape = 21, col = 'white') +
scale_fill_continuous(type = "viridis") +
theme_minimal()We then simulate the observed counts of dead trees.
set.seed(492)
nb_tot <- rep(30,N) # nb of trees per plot
beta_0 <- -6 # intercept
eta <- beta_0 + mu
p <- 1-exp(-exp(eta)) # probability of mortality in each plot
nb_dead <- rbinom(N, nb_tot, p) # nb of dead trees in each plot
ggplot(data.frame(Y = nb_dead,
Longitude = lon,
Latitude = lat),
aes(x = Longitude, y = Latitude, fill = Y)) +
geom_point(size = 5, shape = 21, col = 'white') +
scale_fill_continuous(type = "viridis", name = "nb dead trees") +
theme_minimal()# Stan code
stancode = stan_model(here("scripts/StanModels/ValidationNFI_m5_1_noncentered.stan"))
print(stancode)S4 class stanmodel 'anon_model' coded as follows:
functions{
matrix cov_GPL2(matrix x, real alpha, real rho, real delta) {
real sq_alpha = square(alpha);
real sq_rho = square(rho);
int N = dims(x)[1];
matrix[N, N] K;
K = sq_alpha * exp(-0.5 *(square(x)/sq_rho)) + diag_matrix(rep_vector(delta, N));;
return cholesky_decompose(K);
}
}
data {
int N;
int nb_dead[N]; // Number of dead trees in the plot
int nb_tot[N]; // Total number of trees in the plot
matrix[N, N] Dmat; // Distance matrix (in km)
}
transformed data {
real delta = 1e-9; // Small offset to ensure the covariance matrix is positive definite
}
parameters {
vector[N] eta;
real beta_0;
real<lower=0> alpha;
real<lower=0> rho;
}
transformed parameters {
matrix[N, N] L_K;
vector[N] mu;
L_K = cov_GPL2(Dmat, alpha, rho, delta);
mu = L_K * eta;
}
model {
rho ~ normal(500, 200);
alpha ~ normal(0,1);
beta_0 ~ normal(-4,2);
eta ~ std_normal(); // Multiplier for non-centred GP parameterisation
nb_dead ~ binomial(nb_tot,inv_cloglog(beta_0 + mu));
}
# Stan list
datalist <- list(N=N,
nb_dead=nb_dead,
nb_tot=nb_tot,
Dmat=m)
# Running stan model
mod <- sampling(stancode, data = datalist,
iter = 2000,
chains = 4, cores = 4,
init=0,
save_warmup = FALSE,
pars=c("mu","beta_0","alpha","rho"))
mod %>% saveRDS(file=here("outputs/ValidationNFI/Simulations/m5_1.rds"))mod <- readRDS(file=here("outputs/ValidationNFI/Simulations/m5_1.rds"))
# Model coefficients
broom.mixed::tidyMCMC(mod,
pars=c("beta_0","alpha","rho"),
droppars = NULL,
conf.level=0.95,
estimate.method = "median",
ess = T, rhat = T, conf.int = T) %>%
kable_mydf(round_number=3)| term | estimate | std.error | conf.low | conf.high | rhat | ess |
|---|---|---|---|---|---|---|
| beta_0 | -5.617 | 0.593 | -7.059 | -4.687 | 1.000 | 1537 |
| alpha | 0.923 | 0.458 | 0.145 | 1.979 | 1.003 | 1284 |
| rho | 429.562 | 162.166 | 165.230 | 797.061 | 1.005 | 1503 |
make_coeff_table(mod, pars=c("beta_0","alpha","rho"), true_params = c(beta_0,alpha,rho)) %>%
kable_mydf()| term | mean | mean_absolute_deviation | median | std_deviation | conf80_low | conf80_high | conf95_low | conf95_high | rhat | ess | true_values | conf80_coverage | conf95_coverage |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| beta_0 | -5.62 | 0.53 | -5.69 | 0.59 | -6.47 | -5.04 | -7.06 | -4.69 | 1 | 1537 | -6 | TRUE | TRUE |
| alpha | 0.92 | 0.43 | 0.95 | 0.46 | 0.39 | 1.56 | 0.14 | 1.98 | 1 | 1284 | 1 | TRUE | TRUE |
| rho | 429.56 | 159.09 | 444.00 | 162.17 | 243.65 | 657.37 | 165.23 | 797.06 | 1 | 1503 | 500 | TRUE | TRUE |
Let’s look at the scatter plots for \(\rho\), \(\alpha\) and \(\beta_0\).
mcmc_pairs(as.array(mod),
np = nuts_params(mod),
pars = c("beta_0","alpha","rho"),
off_diag_args = list(size = 0.75))Using the code from Stan Geostatistical Model of Nicholas Clark, we visualize the posterior estimates for the spatial covariance kernel to see if the model captures the simulated covariance well.
# We define a function to compute the covariance kernel for each posterior draw
quad_kernel = function(rho, alpha, distances){covs <- alpha^2 * exp(-0.5 * ((distances / rho)^2))}
# We specify the distances (in km) over which to compute the posterior kernel estimates
distances = seq(0, 2150, length.out = 75)
# we compute posterior kernels
kernels <- matrix(NA, nrow = 1000, ncol = length(distances))
list_posteriors <- rstan::extract(mod, pars=c("rho","alpha"))
for(i in 1:1000){
kernels[i,] <- quad_kernel(rho = list_posteriors$rho[i],
alpha = list_posteriors$alpha[i],
distances = distances)
}
# Calculate posterior empirical quantiles for the kernel
probs <- c(0.05, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.95)
cred <- sapply(1:NCOL(kernels),
function(n) quantile(kernels[,n],
probs = probs))
c_light <- c("#DCBCBC")
c_light_highlight <- c("#C79999")
c_mid <- c("#B97C7C")
c_mid_highlight <- c("#A25050")
c_dark <- c("#8F2727")
c_dark_highlight <- c("#7C0000")
# Plot the estimated kernels and overlay the true simulated kernel in black
pred_vals <- distances
plot(1, xlim = c(0, 2150), ylim = c(0, max(cred)), type = 'n',
xlab = 'Distance (in km)', ylab = 'Covariance',
bty = 'L')
box(bty = 'L', lwd = 2)
polygon(c(pred_vals, rev(pred_vals)), c(cred[1,], rev(cred[9,])),
col = c_light, border = NA)
polygon(c(pred_vals, rev(pred_vals)), c(cred[2,], rev(cred[8,])),
col = c_light_highlight, border = NA)
polygon(c(pred_vals, rev(pred_vals)), c(cred[3,], rev(cred[7,])),
col = c_mid, border = NA)
polygon(c(pred_vals, rev(pred_vals)), c(cred[4,], rev(cred[6,])),
col = c_mid_highlight, border = NA)
lines(pred_vals, cred[5,], col = c_dark, lwd = 2.5)
# Overlay the true simulated kernel
true_kernel <- quad_kernel(rho = rho,
alpha = alpha,
distances = distances)
lines(pred_vals,
true_kernel, lwd = 3.5, col = 'white')
lines(pred_vals,
true_kernel, lwd = 3, col = 'black')We then simulate the observed counts of dead trees.
As I was not able to fit the model M5_1 on a large dataset, I checked that the model M3 was able to recover the true parameter values on simulated data in which I included some autocorrelation among spatial points.
Here is the sample of covariance functions that were likely in the simulations, given the chosen distributions for \(\rho\) and \(\alpha\):
\[\rho \sim \text{Uniform}(50,600)\] \[\alpha \sim \text{Uniform}(0.2,1.5)\]
plot(NULL, xlab="Distance (in km)" , ylab= "Covariance", xlim=c(0,2150), ylim=c(0,3))
n_curve <- 100
rho <- runif(n_curve, 50, 600)
alpha <- runif(n_curve, 0.2, 1.5)
for ( i in 1:n_curve ) curve(alpha[i]^2*exp(-0.5 * ((x / rho[i])^2)) , add=T, col = col.alpha("darkgreen",0.5))# Stan code of model M3
stancode = stan_model(here("scripts/StanModels/ValidationNFI_m3.stan"))
nb_setseed <- sample(1:1000, nb_simulations, replace=FALSE)
# 100 simulations
lapply(nb_setseed,function(seed){
set.seed(seed)
N <- 5000 # number of plots
data <- sample_n(data,N,replace=F) # sample N plots among the true NFI plots
nb_tot <- data$nb_tot # same nb of trees per plot as true data
nb_years <- data$nb_years # same time periods between inventory dates
nb_country <- length(unique(data$country))
country <- as.numeric(data$country)
# Distance matrix among plots (in km)
m <- data %>%
dplyr::select(latitude,longitude) %>%
geodist(measure="geodesic")
m <- m / 1000 # convert distances to kilometers
# Covariance matrix
rho <- runif(1, 50, 600)
alpha <- runif(1, 0.2, 1.5)
K <- alpha^2 * exp(-0.5 * ((m / rho) ^ 2)) + diag(1e-9, N)
mu <- MASS::mvrnorm(1, mu = rep(0, N), Sigma = K)
ggplot(data.frame(GP = mu,
Longitude = data$longitude,
Latitude = data$latitude),
aes(x = Longitude, y = Latitude, fill = GP)) +
geom_point(size = 2, shape = 21, col = 'white') +
scale_fill_continuous(type = "viridis") +
theme_minimal()
alpha_country <- runif(2,min=-8,max=-5) # country-specific intercepts
beta_GO <- runif(1,-0.7,0.7) # coeff of the genomic offset
beta_C <- runif(1,-1.3,1.3) # coeff of the basal area
GO <- rnorm(N,0,1) # genomic offset
GO <- (GO - mean(GO)) / sd(GO) # scaling the genomic offset
C <- rnorm(N,0,1) # basal area
C <- (C - mean(C)) / sd(C) # scaling the basal area
eta <- alpha_country[country] + beta_GO * GO + beta_C * C + log(nb_years) + mu # linear predictor
p <- 1-exp(-exp(eta)) # probability of mortality in each plot
nb_dead <- rbinom(N,nb_tot, p) # nb of dead trees in each plot
ggplot(data.frame(Y = nb_dead,
Longitude = data$longitude,
Latitude = data$latitude),
aes(x = Longitude, y = Latitude, fill = Y)) +
geom_point(size = 3, shape = 21, col = 'white') +
scale_fill_continuous(type = "viridis", name = "nb dead trees") +
theme_minimal()
# Stan list
datalist <- list(N=N,
nb_dead=nb_dead,
nb_tot=nb_tot,
GO=GO,
C=C,
nb_country=nb_country,
country=country,
log_nb_years=log(nb_years))
# Running stan model
mod <- sampling(stancode,
data = datalist,
iter = 2000, chains = 4, cores = 4,init=0,
save_warmup = FALSE)
# Build coeff table
make_coeff_table(mod,
pars=c("alpha_country","beta_GO","beta_C"),
true_params = c(alpha_country,beta_GO,beta_C))
}) %>%
bind_rows(.id = "nb_sim") %>%
saveRDS(file=here("outputs/ValidationNFI/Simulations/sim_m3_withspatialautocorrelation.rds"))The following table shows the coverage of the 80% and 95% credible intervals for the country-specific intercepts, \(\beta_0\) and \(\beta_{GO}\) coefficients from model M3 fitted on simulated data with autocorrelation:
readRDS(file=here("outputs/ValidationNFI/Simulations/sim_m3_withspatialautocorrelation.rds")) %>%
calc_coverage_across_sims %>%
kable_mydf()| parameter | conf80_coverage | conf95_coverage |
|---|---|---|
| alpha_country[1] | 7 | 13 |
| alpha_country[2] | 26 | 37 |
| beta_C | 63 | 76 |
| beta_GO | 67 | 88 |
# Stan code
stancode = stan_model(here("scripts/StanModels/ValidationNFI_m6_1.stan"))
print(stancode)S4 class stanmodel 'anon_model' coded as follows:
data {
int N;
vector[N] log_nb_years; // Offset to account for different census intervals
vector[N] C; // Proxy of the competition among trees (i.e. basal area)
vector[N] DBH; // Proxy of the average tree age (i.e. mean DBH)
vector[N] GO; // Genomic offset
int nb_dead[N]; // Number of dead trees in the plot
int nb_tot[N]; // Total number of trees in the plot
int<lower=0> nb_country; // Number of countries
int<lower=0, upper=nb_country> country[N]; // Countries
}
parameters {
vector[nb_country] alpha_country;
real beta_GO;
real beta_C;
real beta_DBH;
}
model {
nb_dead ~ binomial(nb_tot,inv_cloglog(alpha_country[country] + beta_GO * GO + beta_C * C + beta_DBH * DBH + log_nb_years));
alpha_country ~ normal(0, 1);
beta_GO ~ normal(0, 1);
beta_C ~ normal(0, 1);
beta_DBH ~ normal(0, 1);
}
# Parameters to estimate
params_to_estimate <- c("alpha_country","beta_GO","beta_C", "beta_DBH")set.seed(4922424)
nb_setseed <- sample(1:1000,nb_simulations,replace=FALSE)
# Experimental design
N <- 1000 # nb of plots
nb_tot <- rep(30,N) # nb of trees per plot
nb_country <- 2
country <- c(rep(1,N/2),rep(2,N/2))
# Data simulation
lapply(nb_setseed,function(seed){
set.seed(seed)
nb_years <- sample(5:15, N, replace=T) # random time periods between inventory dates (between 5 and 15 years)
alpha_country <- c(runif(1,-7.5,-6.5),runif(1,-6.5,-5.5)) # country-specific intercepts
beta_GO <- runif(1,0.25,0.75) # coeff of the genomic offset
beta_C <- runif(1,0.5,1.5) # coeff of the basal area
beta_DBH <- runif(1,-1.5,1.5) # coeff of the mean DBH
GO <- rnorm(N,0,1) # genomic offset
GO <- (GO - mean(GO)) / sd(GO) # scaling the genomic offset
C <- rnorm(N,0,1) # basal area
C <- (C - mean(C)) / sd(C) # scaling the basal area
DBH <- rnorm(N,0,1) # basal area
DBH <- (DBH - mean(DBH)) / sd(DBH) # scaling the mean DBH
eta <- alpha_country[country] + beta_GO * GO + beta_C * C + beta_DBH * DBH + log(nb_years) # linear predictor
p <- 1-exp(-exp(eta)) # probability of mortality in each plot
nb_dead <- rbinom(N,nb_tot, p) # nb of dead trees in each plot
# Stan list
datalist <- list(N=N,
nb_dead=nb_dead,
nb_tot=nb_tot,
GO=GO,
C=C,
DBH=DBH,
nb_country=nb_country,
country=country,
log_nb_years=log(nb_years))
# Running stan model
mod <- sampling(stancode, data = datalist, iter = 2000, chains = 4, cores = 4,init=0, save_warmup = FALSE)
# Build coeff table
make_coeff_table(mod,
pars = params_to_estimate,
true_params = c(alpha_country,beta_GO,beta_C,beta_DBH))
}) %>%
bind_rows(.id = "nb_sim") %>%
saveRDS(file=here(paste0("outputs/ValidationNFI/Simulations/",nb_simulations,"simulations_m6_1_balanceddesign.rds")))readRDS(file=here(paste0("outputs/ValidationNFI/Simulations/",nb_simulations,"simulations_m6_1_balanceddesign.rds"))) %>%
calc_coverage_across_sims %>%
kable_mydf()| parameter | conf80_coverage | conf95_coverage |
|---|---|---|
| alpha_country[1] | 75 | 95 |
| alpha_country[2] | 77 | 93 |
| beta_C | 77 | 91 |
| beta_DBH | 77 | 96 |
| beta_GO | 85 | 95 |
set.seed(53)
nb_setseed <- sample(1:1000,nb_simulations,replace=FALSE)
# Experimental design
N <- 1000 # nb of plots
nb_country <- 2 # nb of countries
country <- c(rep(1,N/2),rep(2,N/2))
# Data simulation
lapply(nb_setseed,function(seed){
set.seed(seed)
nb_tot <-sample(5:30, N, replace=T) # nb of trees per plot
nb_years <- sample(5:15, N, replace=T) # random time periods between inventory dates (between 5 and 15 years)
alpha_country <- c(runif(1,-7.5,-6.5),runif(1,-6.5,-5.5)) # country-specific intercepts
beta_GO <- runif(1,0.25,0.75) # coeff of the genomic offset
beta_C <- runif(1,0.5,1.5) # coeff of the basal area
beta_DBH <- runif(1,-1.5,1.5) # coeff of the mean DBH
GO <- rnorm(N,0,1) # genomic offset
GO <- (GO - mean(GO)) / sd(GO) # scaling the genomic offset
C <- rnorm(N,0,1) # basal area
C <- (C - mean(C)) / sd(C) # scaling the basal area
DBH <- rnorm(N,0,1) # mean DBH
DBH <- (DBH - mean(DBH)) / sd(DBH) # scaling the mean DBH
eta <- alpha_country[country] + beta_GO * GO + beta_C * C + beta_DBH * DBH + log(nb_years) # linear predictor
p <- 1-exp(-exp(eta)) # probability of mortality in each plot
nb_dead <- rbinom(N,nb_tot, p) # nb of dead trees in each plot
# Stan list
datalist <- list(N=N,
nb_dead=nb_dead,
nb_tot=nb_tot,
GO=GO,
C=C,
DBH=DBH,
nb_country=nb_country,
country=country,
log_nb_years=log(nb_years))
# Running stan model
mod <- sampling(stancode, data = datalist, iter = 2000, chains = 4, cores = 4,init=0, save_warmup = FALSE)
# Build coeff table
make_coeff_table(mod,
pars = params_to_estimate,
true_params = c(alpha_country,beta_GO,beta_C,beta_DBH))
}) %>%
bind_rows(.id = "nb_sim") %>%
saveRDS(file=here(paste0("outputs/ValidationNFI/Simulations/",nb_simulations,"simulations_m6_1_unbalanceddesign.rds")))readRDS(file=here(paste0("outputs/ValidationNFI/Simulations/",nb_simulations,"simulations_m6_1_unbalanceddesign.rds"))) %>%
calc_coverage_across_sims %>%
kable_mydf()| parameter | conf80_coverage | conf95_coverage |
|---|---|---|
| alpha_country[1] | 66 | 87 |
| alpha_country[2] | 80 | 92 |
| beta_C | 81 | 94 |
| beta_DBH | 73 | 90 |
| beta_GO | 75 | 93 |
set.seed(49205)
nb_setseed <- sample(1:1000,nb_simulations,replace=FALSE)
# Data simulation
lapply(nb_setseed, function(seed){
set.seed(seed)
N <- length(data$plotcode) # same nb of plots as true data
nb_tot <- data$nb_tot # same nb of trees per plot as true data
nb_years <- data$nb_years # same time periods between inventory dates
nb_country <- length(unique(data$country))
country <- as.numeric(data$country)
alpha_country <- runif(2,min=-8,max=-5) # country-specific intercepts
beta_GO <- runif(1,-0.7,0.7) # coeff of the genomic offset
beta_C <- runif(1,-1.3,1.3) # coeff of the basal area
beta_DBH <- runif(1,-1.5,1.5) # coeff of the mean DBH
GO <- rnorm(N,0,1) # genomic offset
GO <- (GO - mean(GO)) / sd(GO) # scaling the genomic offset
C <- rnorm(N,0,1) # basal area
C <- (C - mean(C)) / sd(C) # scaling the basal area
DBH <- rnorm(N,0,1) # mean DBH
DBH <- (DBH - mean(DBH)) / sd(DBH) # scaling the mean DBH
eta <- alpha_country[country] + beta_GO * GO + beta_C * C + beta_DBH * DBH + + log(nb_years) # linear predictor
p <- 1-exp(-exp(eta)) # probability of mortality in each plot
nb_dead <- rbinom(N,nb_tot, p) # nb of dead trees in each plot
# Stan list
datalist <- list(N=N,
nb_dead=nb_dead,
nb_tot=nb_tot,
GO=GO,
C=C,
DBH=DBH,
nb_country=nb_country,
country=country,
log_nb_years=log(nb_years))
# Running stan model
mod <- sampling(stancode, data = datalist, iter = 2000, chains = 4, cores = 4,init=0, save_warmup = FALSE)
# Build coeff table
make_coeff_table(mod,
pars = params_to_estimate,
true_params = c(alpha_country,beta_GO,beta_C,beta_DBH))
}) %>%
bind_rows(.id = "nb_sim") %>%
saveRDS(file=here(paste0("outputs/ValidationNFI/Simulations/",nb_simulations,"simulations_m6_1_truedesign.rds")))readRDS(file=here(paste0("outputs/ValidationNFI/Simulations/",nb_simulations,"simulations_m6_1_truedesign.rds"))) %>%
calc_coverage_across_sims %>%
kable_mydf()| parameter | conf80_coverage | conf95_coverage |
|---|---|---|
| alpha_country[1] | 84 | 95 |
| alpha_country[2] | 78 | 93 |
| beta_C | 78 | 98 |
| beta_DBH | 78 | 97 |
| beta_GO | 81 | 97 |
# Stan code
stancode = stan_model(here("scripts/StanModels/ValidationNFI_m6_2.stan"))
print(stancode)S4 class stanmodel 'anon_model' coded as follows:
data {
int N;
vector[N] log_nb_years; // Offset to account for different census intervals
vector[N] C; // Proxy of the competition among trees (i.e. basal area)
vector[N] DBH; // Proxy of the average tree age (i.e. mean DBH)
vector[N] GO; // Genomic offset
int nb_dead[N]; // Number of dead trees in the plot
int nb_tot[N]; // Total number of trees in the plot
int<lower=0> nb_country; // Number of countries
int<lower=0, upper=nb_country> country[N]; // Countries
}
parameters {
vector[nb_country] alpha_country;
real beta_GO;
real beta_C;
real beta_DBH;
real beta_C_DBH;
}
model {
nb_dead ~ binomial(nb_tot,inv_cloglog(alpha_country[country] + beta_GO * GO + beta_C * C + beta_DBH * DBH + beta_C_DBH * C .* DBH + log_nb_years));
alpha_country ~ normal(0, 1);
beta_GO ~ normal(0, 1);
beta_C ~ normal(0, 1);
beta_DBH ~ normal(0, 1);
beta_C_DBH ~ normal(0, 1);
}
# Parameters to estimate
params_to_estimate <- c("alpha_country","beta_GO","beta_C", "beta_DBH","beta_C_DBH")set.seed(49294)
nb_setseed <- sample(1:1000,nb_simulations,replace=FALSE)
# Experimental design
N <- 1000 # nb of plots
nb_tot <- rep(30,N) # nb of trees per plot
nb_country <- 2
country <- c(rep(1,N/2),rep(2,N/2))
# Data simulation
lapply(nb_setseed,function(seed){
set.seed(seed)
nb_years <- sample(5:15, N, replace=T) # random time periods between inventory dates (between 5 and 15 years)
alpha_country <- c(runif(1,-7.5,-6.5),runif(1,-6.5,-5.5)) # country-specific intercepts
beta_GO <- runif(1,0.25,0.75) # coeff of the genomic offset
beta_C <- runif(1,0.5,1.5) # coeff of the basal area
beta_DBH <- runif(1,-1.5,1.5) # coeff of the mean DBH
beta_C_DBH <- runif(1,-1.5,1.5) # coeff of the interaction between mean DBH and basal area
GO <- rnorm(N,0,1) # genomic offset
GO <- (GO - mean(GO)) / sd(GO) # scaling the genomic offset
C <- rnorm(N,0,1) # basal area
C <- (C - mean(C)) / sd(C) # scaling the basal area
DBH <- rnorm(N,0,1) # basal area
DBH <- (DBH - mean(DBH)) / sd(DBH) # scaling the mean DBH
eta <- alpha_country[country] + beta_GO * GO + beta_C * C + beta_DBH * DBH + beta_C_DBH * C * DBH + log(nb_years) # linear predictor
p <- 1-exp(-exp(eta)) # probability of mortality in each plot
nb_dead <- rbinom(N,nb_tot, p) # nb of dead trees in each plot
# Stan list
datalist <- list(N=N,
nb_dead=nb_dead,
nb_tot=nb_tot,
GO=GO,
C=C,
DBH=DBH,
nb_country=nb_country,
country=country,
log_nb_years=log(nb_years))
# Running stan model
mod <- sampling(stancode, data = datalist, iter = 2000, chains = 4, cores = 4,init=0, save_warmup = FALSE)
# Build coeff table
make_coeff_table(mod,
pars = params_to_estimate,
true_params = c(alpha_country,beta_GO,beta_C,beta_DBH,beta_C_DBH))
}) %>%
bind_rows(.id = "nb_sim") %>%
saveRDS(file=here(paste0("outputs/ValidationNFI/Simulations/",nb_simulations,"simulations_m6_2_balanceddesign.rds")))readRDS(file=here(paste0("outputs/ValidationNFI/Simulations/",nb_simulations,"simulations_m6_2_balanceddesign.rds"))) %>%
calc_coverage_across_sims %>%
kable_mydf()| parameter | conf80_coverage | conf95_coverage |
|---|---|---|
| alpha_country[1] | 75 | 90 |
| alpha_country[2] | 77 | 90 |
| beta_C | 68 | 88 |
| beta_C_DBH | 76 | 94 |
| beta_DBH | 75 | 89 |
| beta_GO | 77 | 92 |
set.seed(53)
nb_setseed <- sample(1:1000,nb_simulations,replace=FALSE)
# Experimental design
N <- 1000 # nb of plots
nb_country <- 2 # nb of countries
country <- c(rep(1,N/2),rep(2,N/2))
# Data simulation
lapply(nb_setseed,function(seed){
set.seed(seed)
nb_tot <-sample(5:30, N, replace=T) # nb of trees per plot
nb_years <- sample(5:15, N, replace=T) # random time periods between inventory dates (between 5 and 15 years)
alpha_country <- c(runif(1,-7.5,-6.5),runif(1,-6.5,-5.5)) # country-specific intercepts
beta_GO <- runif(1,0.25,0.75) # coeff of the genomic offset
beta_C <- runif(1,0.5,1.5) # coeff of the basal area
beta_DBH <- runif(1,-1.5,1.5) # coeff of the mean DBH
beta_C_DBH <- runif(1,-1.5,1.5) # coeff of the interaction between mean DBH and basal area
GO <- rnorm(N,0,1) # genomic offset
GO <- (GO - mean(GO)) / sd(GO) # scaling the genomic offset
C <- rnorm(N,0,1) # basal area
C <- (C - mean(C)) / sd(C) # scaling the basal area
DBH <- rnorm(N,0,1) # mean DBH
DBH <- (DBH - mean(DBH)) / sd(DBH) # scaling the mean DBH
eta <- alpha_country[country] + beta_GO * GO + beta_C * C + beta_DBH * DBH + beta_C_DBH * C * DBH + log(nb_years) # linear predictor
p <- 1-exp(-exp(eta)) # probability of mortality in each plot
nb_dead <- rbinom(N,nb_tot, p) # nb of dead trees in each plot
# Stan list
datalist <- list(N=N,
nb_dead=nb_dead,
nb_tot=nb_tot,
GO=GO,
C=C,
DBH=DBH,
nb_country=nb_country,
country=country,
log_nb_years=log(nb_years))
# Running stan model
mod <- sampling(stancode, data = datalist, iter = 2000, chains = 4, cores = 4,init=0, save_warmup = FALSE)
# Build coeff table
make_coeff_table(mod,
pars = params_to_estimate,
true_params = c(alpha_country,beta_GO,beta_C,beta_DBH,beta_C_DBH))
}) %>%
bind_rows(.id = "nb_sim") %>%
saveRDS(file=here(paste0("outputs/ValidationNFI/Simulations/",nb_simulations,"simulations_m6_2_unbalanceddesign.rds")))readRDS(file=here(paste0("outputs/ValidationNFI/Simulations/",nb_simulations,"simulations_m6_2_unbalanceddesign.rds"))) %>%
calc_coverage_across_sims %>%
kable_mydf()| parameter | conf80_coverage | conf95_coverage |
|---|---|---|
| alpha_country[1] | 68 | 87 |
| alpha_country[2] | 67 | 93 |
| beta_C | 77 | 95 |
| beta_C_DBH | 74 | 94 |
| beta_DBH | 69 | 91 |
| beta_GO | 72 | 91 |
set.seed(49205)
nb_setseed <- sample(1:1000,nb_simulations,replace=FALSE)
# Data simulation
lapply(nb_setseed, function(seed){
set.seed(seed)
N <- length(data$plotcode) # same nb of plots as true data
nb_tot <- data$nb_tot # same nb of trees per plot as true data
nb_years <- data$nb_years # same time periods between inventory dates
nb_country <- length(unique(data$country))
country <- as.numeric(data$country)
alpha_country <- runif(2,min=-8,max=-5) # country-specific intercepts
beta_GO <- runif(1,-0.7,0.7) # coeff of the genomic offset
beta_C <- runif(1,-1.3,1.3) # coeff of the basal area
beta_DBH <- runif(1,-1.5,1.5) # coeff of the mean DBH
beta_C_DBH <- runif(1,-1.5,1.5) # coeff of the interaction between mean DBH and basal area
GO <- rnorm(N,0,1) # genomic offset
GO <- (GO - mean(GO)) / sd(GO) # scaling the genomic offset
C <- rnorm(N,0,1) # basal area
C <- (C - mean(C)) / sd(C) # scaling the basal area
DBH <- rnorm(N,0,1) # mean DBH
DBH <- (DBH - mean(DBH)) / sd(DBH) # scaling the mean DBH
eta <- alpha_country[country] + beta_GO * GO + beta_C * C + beta_DBH * DBH + beta_C_DBH * C * DBH + log(nb_years) # linear predictor
p <- 1-exp(-exp(eta)) # probability of mortality in each plot
nb_dead <- rbinom(N,nb_tot, p) # nb of dead trees in each plot
# Stan list
datalist <- list(N=N,
nb_dead=nb_dead,
nb_tot=nb_tot,
GO=GO,
C=C,
DBH=DBH,
nb_country=nb_country,
country=country,
log_nb_years=log(nb_years))
# Running stan model
mod <- sampling(stancode, data = datalist, iter = 2000, chains = 4, cores = 4,init=0, save_warmup = FALSE)
# Build coeff table
make_coeff_table(mod,
pars = params_to_estimate,
true_params = c(alpha_country,beta_GO,beta_C,beta_DBH,beta_C_DBH))
}) %>%
bind_rows(.id = "nb_sim") %>%
saveRDS(file=here(paste0("outputs/ValidationNFI/Simulations/",nb_simulations,"simulations_m6_2_truedesign.rds")))readRDS(file=here(paste0("outputs/ValidationNFI/Simulations/",nb_simulations,"simulations_m6_2_truedesign.rds"))) %>%
calc_coverage_across_sims %>%
kable_mydf()| parameter | conf80_coverage | conf95_coverage |
|---|---|---|
| alpha_country[1] | 83 | 95 |
| alpha_country[2] | 72 | 91 |
| beta_C | 79 | 95 |
| beta_C_DBH | 80 | 99 |
| beta_DBH | 78 | 94 |
| beta_GO | 87 | 98 |
In this set of simulations, we use the NFI data but we replace the genomic offset predictions at the location of the NFI plots by a randomly generated variable such as \(x \sim \mathcal{N}(0,1)\).
run_randomGO <- function(stancode,data){
random_GO <- rnorm(nrow(data),mean=0,sd=1)
datalist <- list(N=nrow(data),
nb_dead=data$nb_dead,
nb_tot=data$nb_tot,
GO=(random_GO - mean(random_GO)) / sd(random_GO),
C=(data$basal_area-mean(data$basal_area))/sd(data$basal_area),
DBH=(data$mean_DBH - mean(data$mean_DBH))/sd(data$mean_DBH),
nb_country=length(unique(data$country)),
country=as.numeric(data$country),
log_nb_years=log(data$nb_years))
# Running stan model
mod <- sampling(stancode, data = datalist, iter = 2000, chains = 4, cores = 4,init=0, save_warmup = FALSE)
# Model coefficients
conf95 <- broom.mixed::tidyMCMC(mod,
pars=params_to_estimate,
droppars = NULL,
robust = FALSE, # give mean and standard deviation
ess = T, # effective sample size estimates
rhat = T, # Rhat estimates
conf.int = T, conf.level = 0.95 # include 95% credible intervals
) %>%
dplyr::rename(conf95_low=conf.low,
conf95_high=conf.high,
mean=estimate,
std_deviation=std.error)
broom.mixed::tidyMCMC(mod,
pars=params_to_estimate,
droppars = NULL,
robust = TRUE, # give median and mean absolute deviation (= avg distance btw each data point and the mean)
ess = F, rhat = F,
conf.int = T, conf.level = 0.80 # include 80% credible intervals
) %>%
dplyr::rename(conf80_low=conf.low,
conf80_high=conf.high,
median=estimate,
mean_absolute_deviation=std.error) %>%
inner_join(conf95,by=c("term")) %>%
dplyr::select(term,
mean, mean_absolute_deviation,
median, std_deviation,
conf80_low,conf80_high,
conf95_low,conf95_high,
rhat,ess)
}
set.seed(4935)
lapply(1:nb_simulations, function(x) run_randomGO(stancode,data)) %>% saveRDS(here(paste0("outputs/ValidationNFI/Simulations/m6_2_truedata_randomGO_",nb_simulations,"simulations.rds")))# Function to count the number of simulations in which the 95% and 80% credible intervals overlap with zero.
count_overlapp_with_zero_across_sims <- function(x){
x %>%
group_by(term) %>%
group_split() %>%
purrr::map(\(x){
data.frame(parameter=unique(x$term),
conf80_include_zero=sum(x$conf80_include_zero == TRUE),
conf95_include_zero=sum(x$conf95_include_zero == TRUE))
}) %>%
bind_rows()}
# Extract the outputs and identify the simulations in which the 95% and 80% credible intervals overlap with zero.
df <- readRDS(file=here(paste0("outputs/ValidationNFI/Simulations/m6_2_truedata_randomGO_",nb_simulations,"simulations.rds"))) %>%
bind_rows(.id="sim_ID") %>%
dplyr::filter(term == "beta_GO") %>%
mutate(conf80_include_zero = if_else(conf80_low<=0 & conf80_high>=0, TRUE,FALSE),
conf95_include_zero = if_else(conf95_low<=0 & conf95_high>=0, TRUE,FALSE))
# How many times the 95% and 80% credible intervals include zero?
df %>%
count_overlapp_with_zero_across_sims %>%
kable_mydf()| parameter | conf80_include_zero | conf95_include_zero |
|---|---|---|
| beta_GO | 34 | 51 |
# Show the mean and 95% / 80% credible intervals with interval plots
df %>%
ggplot(aes(mean, reorder(sim_ID, mean))) +
geom_vline(xintercept = 0, color="gray30") +
geom_errorbar(aes(xmin = conf95_low, xmax = conf95_high), color="#009E73") +
geom_errorbar(aes(xmin = conf80_low, xmax = conf80_high), color="orange") +
geom_point() +
theme_bw() +
labs(x="Effect size of a randomly generated variable", y="ID of the simulations") +
theme(axis.text.y = element_text(size=6))